Taylor Wolfenberger (tw25848)

Introduction

The two datasets I have chosen describe various natality measures and infant mortality rates for two races, namely, black and white. The two datasets converge from the years 1960-2012. The first dataset, ‘BWNatality’, consists of the variables “Year”, “Race”, “Live Births”, “Birth Rate”, and “Fertility Rate”. Fertility rate is defined as the number of births per 1000 women of reproductive age (15-44), and birth rate is the number of live births per 1000 women in the total population per year (without taking particular age groups into account) (Source: 3). The years of data collection range from 1960 to 2018. The second dataset, ‘InfMortality’, is comprised of the variables “Year”, “Race”, and “Infant Mortality Rate”. Infant Mortality Rate is the number of deaths under one year of age per 1000 live births in a given year. This dataset’s years of data collection range from 1915 to 2013. Both datasets were acquired from the Centers for Disease Control and Prevention, and they were collected by the National Center for Health Statistics (NCHS) (Source: 1,2). The United States requires that all births be accompanied by a birth certificate as documentation, and NCHS utilizes The National Vital Statistics System to publish statistical data from these birth certificates, as publication of birth data is mandated by federal law (Source: 3).

These datasets interest me in particular because I will be attending medical school this July with hopes of specializing in the field of Women’s Health. I have worked in a Neonatal Intensive Care Unit (NICU) as well as in a Women’s Health Clinic throughout college, and I have always felt especially connected to women and babies. In addition, I have taken part in research projects that aim to address disparities in reproductive care for women, such as a lack of access to contraception and how sexual health is impacted for disadvantaged groups. Due to these experiences, I am familiar with how drastically natality measures can differ across ethnicity, race, socioeconomic status, etc., and I anticipate that fertility rate and birth rate will be significantly higher for black women compared to white women. Statistically speaking, there are racial disparities that affect factors such as access to healthcare and contraceptive methods, which may put black women at a higher risk of becoming pregnant and/or experiencing complications in childbirth.

Initial Step: Loading Datasets & Joining

library(tidyverse)
library(dplyr)
#Reading in datasets to be used
InfMortality<-read.csv("~/Downloads/InfantMortalityByRace.csv")
BWNatality<-read.csv("~/Downloads/BWNatality.csv")
#Performing an inner_join based on Year and Race
fulldataset<-inner_join(BWNatality, InfMortality, by=c("Year", "Race"))
#Omitting NAs, just for good measure
fulldataset<-inner_join(BWNatality, InfMortality, by=c("Year", "Race"))%>%na.omit()

1. Tidying/Untidying

The datasets were already tidy when the project began. Below, I demonstrate use of the pivot_longer and pivot_wider functions on the full project dataset. A majority of the numeric variables in this project are natality measures of some sort; using pivot_longer, I rearranged all of these variables (Live Births, Birth Rate, Ferility Rate, Infant Mortality Rate) to go under one column called “Natality Measure.” I purposely left out Year and Race from the untidying by specifying only particular columns. All of the values for each of these measures then fell into a second column called “Value.” I then used pivot_wider to manipulate the data back into its original tidy form, with each “Natality Measure” variable having its own column. I also use pivot_longer and pivot_wider throughout the project in other sections.

#Pivoting the dataset longer by creating a single column called "Natality Measure"
longdata<-fulldataset%>%pivot_longer((3:6), names_to="Natality Measure", values_to="Value")
longdata%>%head()
#Pivoting the dataset wider to restore it back to normal (making it tidy again)
longdata%>%pivot_wider(names_from="Natality Measure", values_from="Value")%>%head()

2. Joining

Below, I used an inner_join to merge the two datasets together based on the common variables “Year” and “Race”; both datasets contained data for only Black and White races over the years. The dataset ‘BWNatality’ had 118 rows prior to the join, and ‘InfMortality’ had 198 rows. The inner_join resulted in 104 rows, so a total of 94 cases were dropped. The dropped cases are the result of the ‘InfMortality’ data going as far back as the year 1915 for both races, while the ‘BWNatality’ data was not collected until the year 1960 onward. For the purposes of this project, I wanted all of the years to match up so that there would be no missing data for any variables. This is precisely why I chose to use an inner_join, as I wanted to drop rows in any dataset that did not have a match. If I had used a full_join (for instance), the years 1915-1959 would be lacking data from the ‘BWNatality’ dataset, resulting in lots of NAs. Despite its advantages, the implication of using the inner_join is that the project analysis may not be as accurate as one would hope, as we are now excluding decades of infant mortality data that are otherwise available.

#Performing an inner_join based on Year and Race
fulldataset<-inner_join(BWNatality, InfMortality, by=c("Year","Race"))
#Omitting NAs, just for good measure
fulldataset<-inner_join(BWNatality, InfMortality, by=c("Year","Race"))%>%na.omit()
#Finding the number of rows after the join
nrow(fulldataset)
## [1] 104

3. Wrangling

Demonstrating Use of the 6 Core Dplyr Functions

##Mutate - Creating a New Variable, "Infant Deaths," as a function of Live Births and Infant Mortality Rate. Then arranging by descending number of Infant Deaths to observe how numbers have changed over time
fulldataset<-fulldataset%>%mutate(Infant.Deaths=((Live.Births/1000)*Infant.Mortality.Rate))%>%arrange(desc(Infant.Deaths))
head(fulldataset)
#Note: it is interesting to see how the number of infant deaths has decreased over time 

##Arrange - By descending infant mortality rate, to see which race and year has the highest rates
fulldataset%>%arrange(desc(Infant.Mortality.Rate))%>%head()
##Filter - For rows where the birth rate is less than the median birth rate; then finding the counts of each Race (to see whose birth rates primarily fall below the median)
fulldataset%>%filter(Birth.Rate<median(Birth.Rate))%>%head()
fulldataset%>%filter(Birth.Rate<median(Birth.Rate))%>%count(Race)
##Summarize - Finding the median fertility rate and median birth rate for both races (also using filter once again here)
fulldataset%>%filter(Race=="Black")%>%summarize(median(Fertility.Rate))
fulldataset%>%filter(Race=="White")%>%summarize(median(Fertility.Rate))
fulldataset%>%filter(Race=="Black")%>%summarize(median(Birth.Rate))
fulldataset%>%filter(Race=="White")%>%summarize(median(Birth.Rate))
##Select - Only want to view the 'Race' and 'Infant Mortality Rate' columns side-by-side; then arranged by descending Infant Mortality Rate
fulldataset%>%select(Race,Infant.Mortality.Rate)%>%arrange(desc(Infant.Mortality.Rate))%>%head()
##Group_By - Grouping by 'Race' and then summarizing to find the mean number of live births
fulldataset%>%group_by(Race)%>%summarize(mean(Live.Births))

Generating Summary Statistics

###OVERALL summary statistics for each numeric variable: 
#Making Natality Measure names look a bit nicer first
overallstats<-fulldataset%>%rename("Infant Mortality Rate" = Infant.Mortality.Rate, "Birth Rate" = Birth.Rate, "Infant Deaths" = Infant.Deaths, "Live Births" = Live.Births, "Fertility Rate" = Fertility.Rate)
#Generating summary statistics (mean, median, min, max, and sd) for all numeric variables OVERALL
overallstats<-overallstats%>%summarize_at(c("Live Births","Birth Rate", "Fertility Rate", "Infant Mortality Rate", "Infant Deaths"), list("Mean"=mean, "Median"=median, "Minimum"=min, "Maximum"=max, "Standard Deviation"=sd), na.rm=T)%>%format(scientific=FALSE)
#Pivoting longer to put all Natality Measures under one column (and Summary Statistics under one column as well)
overallstats<-overallstats%>%pivot_longer(contains("_"))%>%separate(name,into=c("Natality Measure","Summary Statistic"), sep="_")
#Pivoting wider to give each Summary Statistic its own column - this will help with kable tables
overallstats<-overallstats%>%pivot_wider(names_from="Summary Statistic", values_from="value")%>%mutate_if(is.numeric, round, 3)

###Summary statistics after GROUPING by Race:
#Making Natality Measure names look a bit nicer
groupstats<-fulldataset%>%rename("Infant Mortality Rate" = Infant.Mortality.Rate, "Birth Rate" = Birth.Rate, "Infant Deaths" = Infant.Deaths, "Live Births" = Live.Births, "Fertility Rate" = Fertility.Rate)
#Generating summary statistics (mean, median, min, max, and sd) for numeric variables after grouping by Race
groupstats<-groupstats%>%group_by(Race)%>%summarize_at(c("Live Births","Birth Rate", "Fertility Rate", "Infant Mortality Rate", "Infant Deaths"), list("Mean"=mean, "Median"=median, "Minimum"=min, "Maximum"=max, "Standard Deviation"=sd), na.rm=T)
#Pivoting longer to put all Natality Measures under one column (and Summary Statistics under one column as well)
groupstats<-groupstats%>%pivot_longer(contains("_"))%>%separate(name,into=c("Natality Measure","Summary Statistic"), sep="_")
#Pivoting wider to give each Summary Statistic its own column
groupstats<-groupstats%>%pivot_wider(names_from="Summary Statistic", values_from="value")%>%mutate_if(is.numeric, round, 3)

###Making Kable tables 
library(knitr)
library(kableExtra)
#Creating table for overall summary statistics 
overallstats%>%kable(align="c")%>%kable_styling()
Natality Measure Mean Median Minimum Maximum Standard Deviation
Live Births 1877908 2559805 507162 3600864 1250023
Birth Rate 17.625 16.3 12 31.9 4.111829
Fertility Rate 78.25385 69.1 61.5 153.5 18.96815
Infant Mortality Rate 15.99327 14.25 5.1 44.3 9.320916
Infant Deaths 23391.3 18379.01 6911.973 82457.04 17640.63
#Creating table for summary statistics after grouping by race; table is colored by Race
groupstats%>%kable(align="c")%>%kable_styling("striped", full_width = F)%>%row_spec(1:5, bold = T, color = "white", background = "#FDB4AD")%>%  row_spec(6:10, bold = T, color = "white", background = "#B2CDE4")%>%column_spec(1, bold=T, color="black")
Race Natality Measure Mean Median Minimum Maximum Standard Deviation
Black Live Births 597610.200 599880.00 507162.000 684336.00 49144.693
Black Birth Rate 20.308 20.60 14.700 31.90 3.881
Black Fertility Rate 85.888 80.50 65.100 153.50 21.067
Black Infant Mortality Rate 21.472 18.70 10.900 44.30 9.257
Black Infant Deaths 12572.809 11403.54 6911.973 26680.29 4933.372
White Live Births 3063369.519 3092160.50 2551030.000 3600864.00 237108.156
White Birth Rate 15.141 14.50 12.000 22.70 2.419
White Fertility Rate 71.185 65.05 61.500 113.20 13.527
White Infant Mortality Rate 10.920 8.65 5.100 22.90 5.923
White Infant Deaths 33408.412 26313.12 15227.361 82457.04 19225.423
###Creating a correlation matrix of all numeric variables
cormatrix<-fulldataset%>%na.omit%>%select_if(is.numeric)%>%select(-Year)
cor(cormatrix)
##                       Live.Births  Birth.Rate Fertility.Rate
## Live.Births             1.0000000 -0.60401501     -0.3568016
## Birth.Rate             -0.6040150  1.00000000      0.9276752
## Fertility.Rate         -0.3568016  0.92767517      1.0000000
## Infant.Mortality.Rate  -0.5756488  0.93198833      0.9205023
## Infant.Deaths           0.6017033  0.04976877      0.2811915
##                       Infant.Mortality.Rate Infant.Deaths
## Live.Births                      -0.5756488    0.60170332
## Birth.Rate                        0.9319883    0.04976877
## Fertility.Rate                    0.9205023    0.28119153
## Infant.Mortality.Rate             1.0000000    0.14236235
## Infant.Deaths                     0.1423623    1.00000000

Using “mutate”, a new variable called “Infant.Deaths” was created. By definition, Infant Mortality Rate is the number of deaths per 1,000 live births under one year of age. Therefore, in one year, the number of Infant Deaths = (Live.Births/1000) * Infant Mortality Rate. This new “Infant.Deaths” variable consists of the infants who were considered live births but did not survive to one year of age. The “arrange” function was used to arrange the data by descending infant mortality rate; it was found that the year 1960 had the greatest infant mortality rate, with the trend being that infant mortality rate has decreased over time. “Filter” was used to pick out the rows whose birth rate was less than the median birth rate; a large majority of these observations were white women (43/51). “Summarize” was used to find the median fertility rate and birth rate for each race; both were lower for Whites. “Select” was utilized to view only the ‘Race’ and ‘Infant Mortality Rate’ columns side-by-side; black women had the maximum infant mortality rate. Lastly, “Group_by” made groups based on ‘Race’, and “summarize” was used in conjunction to find the the mean number of live births based on race. It was discovered that white women had a much higher number of live births on average.

Regarding the summary statistics: The mean, median, minimum, maximum, and standard deviation were calculated for all numeric variables both overall and after grouping by Race. As anticipated, the mean, median, minimum, maximum, and standard deviation of both birth rates and fertility rates are higher for black women as compared to white women. However, white women have nearly 5x as many live births (on average) as compared to black women, which also goes hand-in-hand with a greater possible number of infant deaths. Overall, white women have a lower mean, median, minimum, maximum, and standard deviation for infant mortality rate. Though there are likely many factors that play into this, these disparities could very well be the result of socioeconomic status differences. For instance, geographical proximity to higher-quality hospital facilities for white women might result in lower infant mortality rates as well as a greater likelihood for having a live birth. Both the minimum and maximum infant mortality rates for blacks are two times that of whites. As for the correlation matrix, negative correlations exist between birth rate and number of live births, fertility rate and number of live births, and infant mortality rate and number of live births. All other combinations of variables had positive correlations. Correlation results are discussed more thoroughly in the following section.

4. Visualization

#In the Wrangling section of the project (above), I already created an object called 'cormatrix' that contains only the numeric variables
###Creating a correlation heatmap of numeric variables - Plot 1
fulldatasetcor<-cor(cormatrix)%>%as.data.frame%>%
  rownames_to_column%>%
  pivot_longer(-1,names_to="name",values_to="correlation")
head(fulldatasetcor)
fulldatasetcor%>%ggplot(aes(rowname,name,fill=correlation))+
geom_tile()+
scale_fill_gradient2(low="red",mid="white",high="green")+
geom_text(aes(label=round(correlation,2)),color = "black", size = 4)+ theme(axis.text.x = element_text(angle = 90, hjust=1))+coord_fixed()+ggtitle("Correlation Heatmap of Numeric Variables")+ylab("Variables") + xlab("Variables")

Plot 1 - Correlation Heatmap of Numeric Variables

In this correlation heatmap, negative correlations are shown in shades of red, and positive correlations are shown in shades of green. The white squares indicate nearly-neutral values. There are very strong (>0.90) positive correlations between fertility rate and birth rate, infant mortality rate and birth rate, and fertility rate and infant mortality rate. This makes sense, as a higher fertility rate will likely lead to a greater birth rate in turn (higher likelihood of pregnancy = more babies being born)! There is a medium-strength (0.6) positive correlation between live births and infant deaths; since there is a greater number of babies being born, it follows that there is an increased number of infant deaths before 1 year of life as well. There are weak (0.05-0.28) positive correlations between infant deaths and fertility rate, infant deaths and infant mortality rate, and infant deaths and birth rate. In contrast, there are negative correlations between live births and birth rate, fertility rate and live births, and infant mortality rate and live births. Though this is purely speculation, it could be that as birth rate increases, the number of live births decreases, as there may be a shortage of resources or hospital staff to adequately assist at every birth.

###Plot 2 - Fertility Rate and Birth Rate by Race Over Time
#Selecting only Fertility Rate and Birth Rate variables, grouped by Year and Race - also renaming them to look nicer on the graph
fertilityandbirth<-fulldataset%>%group_by(Year, Race)%>%select(Fertility.Rate,Birth.Rate)%>%rename("Fertility Rate" = Fertility.Rate, "Birth Rate" = Birth.Rate)
#Pivoting longer to put Fertility Rate and Birth Rate under a single column, "Natality Measure"
fertilityandbirth<-fertilityandbirth%>%pivot_longer(cols = 3:4, names_to="NatalityMeasure", values_to="Value")
#Making the ggplot; changing the theme and its color scheme; changing the distribution of tick marks on the x-axis
ggplot(fertilityandbirth, aes(Year, Value, color=Race))+geom_point(aes(shape=NatalityMeasure))+ggtitle("Fertility Rate and Birth Rate by Race Over Time (1960-2012)")+scale_x_continuous(breaks=seq(1960, 2020, 10))+ylab("Rate (Per 1000 Women)") + xlab("Year")+theme_light()+scale_color_brewer(palette = "Pastel1")+scale_shape_discrete(name = "Natality Measure", labels = c("Birth Rate", "Fertility Rate"))

Plot 2 - Fertility Rate and Birth Rate by Race Over Time

This graph depicts changes in birth rate and fertility rate from 1960 to 2012, for both black and white races. The most interesting trend here is that both rates for black and white women seem to be converging more and more over time, as the gap is visibly decreasing. It is encouraging to see this, as it suggests that potentially we are on track towards eradicating racial disparities in reproductive health outcomes. Birth rate has steadily declined since the 1960s for both races, and fertility rate has declined drastically, though it has fluctuated much more throughout the years. It was actually during the 1960s (in the Griswold v. Connecticut case) that the government gave married couples the right to use birth control, and in the 1970s, birth control was legalized for all U.S. citizens (Source: 4). Perhaps this newfound right for women to be in control of their reproductive health explains why we see the starkest decline in both birth rate and fertility rate during this time period, though there are likely several other compounding factors as well.

###Plot 3 - Bar Plot, Mean Birth Rate and Infant Mortality Rate by Race
#Selecting only Birth Rate and Infant Mortality Rate variables, grouped by Race
fulldataset2<-fulldataset%>%group_by(Race)%>%select(Infant.Mortality.Rate,Birth.Rate)%>%rename("Infant Mortality Rate" = Infant.Mortality.Rate, "Birth Rate" = Birth.Rate)
#Pivoting longer to put Birth Rate and Infant Mortality Rate under a single column, "Natality Measure"
fulldataset3<-fulldataset2%>%pivot_longer(cols = 2:3, names_to="NatalityMeasure", values_to="Value")
#Creating ggplot, taking the MEAN of both variables; changing the theme and color scheme
ggplot(fulldataset3, aes(x=NatalityMeasure, y=Value, fill=Race))+ geom_bar(stat="summary",fun.y=
"mean", position="dodge", width=0.8)+ ggtitle("Mean Birth Rate and Infant Mortality Rate by Race")+ylab("Mean") + xlab("Natality Measure")+theme_bw()+scale_fill_brewer(palette = "Pastel1")+geom_errorbar(stat="summary",aes(y=Value), fun.data='mean_se',width=.8,position="dodge")

Plot 3 - Mean Birth Rate and Infant Mortality Rate by Race

This graph depicts the differences in mean birth rate and mean infant mortality rate for both black and white races. It is evident that the means of both rates are quite a bit lower for white women. The mean infant mortality rate for white women is about half that of black women, which is surprising considering that there is less of a gap in their birth rates. It is evident that white women have a lesser chance of losing a child in its first year of life, whether that be due to access to resources or other factors. The correlation heatmap (Plot 1) showed a strong positive correlation (0.93) between birth rate and infant mortality rate, so it makes sense that the high birth rate for black women would be matched by a high infant mortality rate as well. Standard error bars are shown as well; it is evident that there is greater variability in the Infant Mortality Rate measure. This bar plot easily enables one to visualize the differences in means, which further depicts racial disparities in reproductive outcomes.

5. Dimensionality Reduction

PAM Clustering

library(cluster)
#Creating a 'clustdata' object that contains only the variables Race, Birth Rate, Fertility Rate, and Infant Mortality Rate
clustdata<-fulldataset%>%select(Race,Birth.Rate,Fertility.Rate,Infant.Mortality.Rate)
#Determining the most ideal number of clusters to use with silhouette width function - The result is k=3
sil_width<-vector()
for(i in 2:10){
  pam_fit <- pam(clustdata, k = i)
  sil_width[i] <- pam_fit$silinfo$avg.width
  }
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)

#Selecting the k=3 points at random to serve as initial centers; scaling numerics
pam1 <- clustdata %>% mutate_if(is.numeric, scale) %>% pam(k=3)
head(pam1)
## $medoids
##      Race Birth.Rate Fertility.Rate Infant.Mortality.Rate
## [1,]   NA  1.8665660       1.958343             1.8567629
## [2,]   NA  0.7721625       0.197497             0.5156930
## [3,]   NA -0.6384021      -0.709286            -0.6108058
## 
## $id.med
## [1] 43 77 22
## 
## $clustering
##   [1] 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 1 3 1 3
##  [38] 3 3 1 3 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [ reached getOption("max.print") -- omitted 4 entries ]
## 
## $objective
##     build      swap 
## 0.7719345 0.6791504 
## 
## $isolation
##  1  2  3 
## no no no 
## Levels: no L L*
## 
## $clusinfo
##      size max_diss   av_diss diameter separation
## [1,]   12 3.266753 1.2190254 4.590333  0.3687929
## [2,]   31 1.572468 0.7017242 2.183142  0.3687929
## [3,]   61 1.167642 0.5614736 2.107822  0.3915206
#Saving cluster assignments to a 'pamclust' dataset
pamclust<-clustdata %>% mutate(cluster=as.factor(pam1$clustering))
#Creating the ggplot - depicts the visualization of only 2 variables, Birth Rate and Fertility Rate, but in clusters based on the 3 selected variables
pamclust %>% ggplot(aes(Birth.Rate,Fertility.Rate,color=cluster, shape=Race)) + geom_point()

This plot is a bit less informative than the three-dimensional one displayed below, as it depicts the visualization of only 2 variables, Birth Rate and Fertility Rate. However, it is still in clusters based on Birth Rate, Fertility Rate, and Infant Mortality Rate, and it is a bit easier to look at as compared to the 3D plot.

#Summarizing clusters; finding each of their means for the 3 variables
pamclust %>% group_by(cluster) %>% summarize_if(is.numeric,mean,na.rm=T)
library(plotly)
#Creating three-dimensional plot based on the 3 distinct variables (also changing the symbols based on Race)
pamclust%>%plot_ly(x= ~Birth.Rate,  y = ~Fertility.Rate, z = ~Infant.Mortality.Rate, color= ~cluster,
        type = "scatter3d", mode = "markers", symbol = ~Race, symbols = c('circle','x'))%>%layout(autosize = F, width = 800, height = 500)
#Visualizing all pairwise combinations of the 3 variables; colors denote the different clusters, and shapes differ by Race 
library(GGally)
ggpairs(pamclust, columns=2:4, aes(color=cluster, shape=Race))

#Interpreting average silhouette width; result indicates that the structure is reasonable
plot(pam1,which=2)

#Silhouette widths for each cluster - these widths are also shown in the plot above (on the right side of the plot)
pam1$silinfo$clus.avg.widths
## [1] 0.3548264 0.5546103 0.6644349
#Finding the proportions of Black/White Races in each cluster
clustdata%>%mutate(cluster=pam1$clustering)%>%group_by(cluster)%>%
  filter(!is.na(Race))%>%count(Race)%>%mutate(prop=n/sum(n))%>%
  pivot_wider(-n,names_from=Race,values_from=prop,values_fill = list(prop=0))
#Proportions of cluster assignment (by Race)
clustdata%>%mutate(cluster=pam1$clustering)%>%
  group_by(Race)%>%count(cluster)%>%mutate(prop=n/sum(n))%>%
  pivot_wider(-n,names_from=Race,values_from=prop,values_fill = list(prop=0))

Using PAM, or Partioning Around Medoids, I created clusters based on the numeric variables Infant Mortality Rate, Fertility Rate, and Birth Rate. The individual observations were also distinguished by Race. The output resulted in 3 distinct clusters that represent different “levels” of the three aforementioned numeric variables. More specifically, Cluster 1 consists of outstandingly high birth rates, fertility rates, and infant mortality rates; Cluster 2 is comprised of intermediate rates for all 3 variables, and Cluster 3 consists of all low rates. A silhouette width function was used in order to access how well the points were assigned to their clusters. With an average silhouette width of 0.6, it can be stated that a reasonable structure has been found. Clusters 2 and 3 are relatively cohesive; however, Cluster 1 is a bit spaced out. This is also indicated by the silhouette width for Cluster 1 being 0.355, which is pretty weak compared to Clusters 2 and 3, whose silhouette widths are 0.555 and 0.664, respectively. This represents the fact that there is still a lot of variation among observations in the high birth rate/fertility rate/infant mortality rate cluster.

All 3 variables are strongly positively correlated with one another (>0.92 for all combinations), and all distinguish the clusters quite well. Interestingly, Cluster 1 is almost entirely comprised of black women, who make up 83.33% of this high birth rate, fertility rate, and infant mortality rate cluster. In contrast, Cluster 3 (with low birth, fertility, and infant mortality rates) is comprised primarily of white women (70.5%). Cluster 2 is intermediate in its rates, and it is composed of 71% black women and 29% white women. Out of all observations for white women, 79.6% lie in Cluster 3, while black women are more evenly spread across all 3 clusters.

Note: I originally tried to cluster based on ALL numeric variables in the project; however, the clusters were not well-distinguished at all. The output was much more informative when only 3 numeric variables were considered: Birth Rate, Infant Mortality Rate, and Fertility Rate.

Works Cited

Important sources that were utilized throughout the project are as follows:

  1. ‘BWNatality’ dataset obtained from: https://data.cdc.gov/NCHS/NCHS-Natality-Measures-for-Females-by-Race-and-His/89yk-m38d/data

  2. ‘InfMortality’ dataset obtained from: https://data.cdc.gov/NCHS/NCHS-Infant-Mortality-Rates-by-Race-United-States-/ddsk-zebd

  3. NCHS information and definitions for natality measures from: https://www.cdc.gov/nchs/nvss/births.htm?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fnchs%2Fbirths.htm

  4. Information on the history of birth control legalization in the U.S.: https://www.ourbodiesourselves.org/book-excerpts/health-article/a-brief-history-of-birth-control/